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k^ ', Abstract 

^— V ' The problem of detecting the sparsity pattern of a fc-sparse vector in R" from ra random noisy measurements is 

^SJ , of interest in many areas such as system identification, denoising, pattern recognition, and compressed sensing. This 

«, ■ paper addresses the scaUng of the number of measurements m, with signal dimension n and sparsity-level nonzeros 

Oh. k, for asymptotically-reliable detection. We show a necessary condition for perfect recovery at any given SNR for 

•^ ■ all algorithms, regardless of complexity, is to = ^{k\og{n ~ k)) measurements. Conversely, it is shown that this 

scaling of Q.{k\og{n — k)) measurements is sufficient for a remarkably simple "maximum correlation" estimator. 
Hence this scaling is optimal and does not require more sophisticated techniques such as lasso or matching pursuit. 
The constants for both the necessary and sufficient conditions are precisely defined in terms of the minimum-to- 
average ratio of the nonzero components and the SNR. The necessary condition improves upon previous results 
for maximum likelihood estimation. For lasso, it also provides a necessary condition at any SNR and for low SNR 
^. improves upon previous work. The sufficient condition provides the first asymptotically-reliable detection guarantee 

CJ \ at finite SNR. 
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Suppose one is given an observation y E W^ that was generated through y = Ax + d, where A E 
is known and d E M™ is an additive noise vector with a known distribution. It may be desirable for an 
^ estimate of x to have a small number of nonzero components. An intuitive example is when one wants 
to choose a small subset from a large number of possibly-related factors that linearly influence a vector 
of observed data. Each factor corresponds to a column of A, and one wishes to find a small subset of 
columns with which to form a linear combination that closely matches the observed data y. This is the 
subset selection problem in (linear) regression [1], and it gives no reason to penalize large values for the 
nonzero components. 

In this paper, we assume that the true signal x has k nonzero entries and that k is known when estimating 
X from y. We are concerned with establishing necessary and sufficient conditions for the recovery of the 
positions of the nonzero entries of x, which we call the sparsity pattern. Once the sparsity pattern is 
correct, n — k columns of A can be ignored and the stability of the solution is well understood; however, 
we do not study any other performance criterion. 
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A. Previous Work 

Sparsity pattern recovery (or more simply, sparsity recovery) has received considerable attention in a 
variety guises. Most transparent from our formulation is the connection to sparse approximation. In a 
typical sparse approximation problem, one is given data y G M™, dictionarjo A G M'"^", and tolerance 
e > 0. The aim is to find x with the fewest number of nonzero entries among those satisfying || Ax— y || < e. 
This problem is NP-hard [3] but greedy heuristics (matching pursuit [2] and its variants) and convex 
relaxations (basis pursuit [4], lasso [5] and others) can be effective under certain conditions on A and 
y [6]-[8]. Scaling laws for sparsity recovery with any A were first given in [9]. 

More recently, the concept of "sensing" sparse x through multiplication by a suitable random matrix A, 
with measurement error d, has been termed compressed sensing [10]-[12]. This has popularized the study 
of sparse approximation with respect to random dictionaries, which was considered also in [13]. Results 
are generally phrased as the asymptotic scaling of the number of measurements m (the length of y) needed 
for sparsity recovery to succeed with high probability, as a function of the other problem parameters. More 
specifically, most results are sufficient conditions for specific tractable recovery algorithms to succeed. For 
example, if A has i.i.d. Gaussian entries and d = 0, then m x 2A; log(n/A;) dictates the minimum scaling 
at which basis pursuit succeeds with high probability [14]. With nonzero noise variance, necessary and 
sufficient conditions for the success of lasso in this setting have the asymptotic scaling [15] 

m X 2A; log(n - k) + k + 1. (1) 

To understand the ultimate limits of sparsity recovery, while also casting light on the efficacy of lasso 
or orthogonal matching pursuit (OMP), it is of interest to determine necessary and sufficient conditions 
for an optimal recovery algorithm to succeed. Of course, since it is sufficient for lasso, the condition 
([T]) is sufficient for an optimal algorithm. Is it close to a necessary condition? We address precisely this 
question by proving a necessary condition that differs from ([T]) by a factor that is constant with respect 
to n and k while depending on the signal-to-noise ratio (SNR) and mean-to-average ratio (MAR), which 
will be defined precisely in Section |IIl Furthermore, we present an extremely simple algorithm for which 
a sufficient condition for sparsity recovery is similarly within a constant factor of ([T]). 

Previous necessary conditions had been based on information-theoretic analyses such as the capacity 
arguments in [16], [17] and a use of Fano's inequality in [18]. More recent publications with necessary 
conditions include [19]-[22]. As described in Section Unl our new necessary conditions are stronger than 
the previous results. 

Table |I] previews our main results and places ([T]) in context. The measurement model and parameters 
MAR and SNR are defined in the following section. Arbitrarily small constants have been omitted, and 
the last column — labeled simply SNR ^ oo — is more specifically for MAR > e > for some fixed e and 
SNR = Vt{k). 



B. Paper Organization 

The setting is formalized in Section [III In particular, we define our concepts of signal-to-noise ratio and 
mean-to-average ratio; our results clarify the roles of these quantities in the sparsity recovery problem. 
Necessary conditions for success of any algorithm are considered in Section Unl There we present a new 
necessary condition and compare it to previous results and numerical experiments. Section HVl introduces 
a very simple recovery algorithm for the purpose of showing that a sufficient condition for its success is 
rather weak — it has the same dependence on n and k as ([T])- Conclusions are given in Section |Vl and 
proofs appear in the Appendix. 

*The term seems to have originated in [2] and may apply to A or the columns of A as a set. 
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finite SNR 


SNR^ oo 


Any algorithm must fail 


Theorem [T] 


m < k 
(elementary) 


Necessary and 
sufficient for lasso 


unknown (expressions above 
and right are necessary) 


mx 2klog{n-k) + k+ 1 
Wain Wright [15] 


Sufficient for maximum 
correlation estimator {8j 


Theorem |2] 


^> Mm^^°s{n k) 
from Theorem |2] 



TABLE I 

Summary of Results on Measurement Scaling for Reliable Sparsity Recovery 

(SEE body for definitions AND TECHNICAL LIMITATIONS) 



II. Problem Statement 
Consider estimating a fc-sparse vector x G M" through a vector of observations, 

y = Ax + d, (2) 

where A E M"^^" is a random matrix with i.i.d. A/'(0, 1/m) entries and d E M"* is i.i.d. unit-variance 
Gaussian noise. Denote the sparsity pattern of x (positions of nonzero entries) by the set Itmc, which is 
a /c-element subset of the set of indices {1,2,..., n}. Estimates of the sparsity pattern will be denoted 
by / with subscripts indicating the type of estimator. We seek conditions under which there exists an 
estimator such that / = /true with high probability. 

In addition to the signal dimensions, m, n and k, we will show that there are two variables that dictate 
the ability to detect the sparsity pattern reliably: the SNR, and what we will call the minimum-to-average 
ratio. 

The SNR is defined by 



SNR 



E 



m 



Since we are considering x as an unknown deterministic vector, the SNR can be further simplified as 
follows: The entries of A are i.i.d. A/'(0, 1/m), so columns a^ E M™ and aj E M™ of A satisfy E[a^aj] = 5ij. 
Therefore, the signal energy is given by 



E [1 Ax |2] = E 


1 U (^j^jl^ 








Substituting into the definition ([3]), the SNR is given by 


SNR = 


-\x\' 
m 





The minimum-to-average ratio of x is defined as 



MAR 



||xP/fc 



(4) 



(5) 



Since Hsp/A; is the average of {\xj\'^ \ j E /true}, MAR E (0, 1] with the upper limit occurring when all 
the nonzero entries of x have the same magnitude. 
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Remarks: Other works use a variety of normalizations, e.g.: the entries of A have variance 1/n in [12], 
[20]; the entries of A have unit variance and the variance of d is a variable cr^ in [15], [18], [21], [22]; 
and our scaling of A and a noise variance of a^ are used in [23]. This necessitates great care in comparing 
results. 

Some results involve 

MAR ■ SNR = — min |x,p. 
m jeitvuc 

While a similar quantity affects a regularization weight sequence in [15], there it does not affect the 
number of measurements required for the success of lassoo The magnitude of the smallest nonzero entry 
of X is also prominent in the phrasing of results in [21], [22]. 

in. Necessary Condition for Sparsity Recovery 

We first consider sparsity recovery without being concerned with computational complexity of the 
estimation algorithm. Since the vector x E M" is A;-sparse, the vector Ax belongs to one of L = (2) 
subspaces spanned by k of the n columns of A. Estimation of the sparsity pattern is the selection of one 
of these subspaces, and since the noise d is Gaussian, the probability of error is minimized by choosing 
the subspace closest to the observed vector y. This results in the maximum likelihood (ML) estimate. 

Mathematically, the ML estimator can be described as follows. Given a subset J C {1, 2, ... , n}, let 
Pjy denote the orthogonal projection of the vector y onto the subspace spanned by the vectors {aj \ j E J}. 
The ML estimate of the sparsity pattern is 

/ml = argmax 1 1 Pjyf, 

J : \J\=k 

where \J\ denotes the cardinality of J. That is, the ML estimate is the set of k indices such that the 
subspace spanned by the corresponding columns of A contain the maximum signal energy of y. 

Since the number of subspaces, L, grows exponentially in n and k, an exhaustive search is computa- 
tionally infeasible. However, the performance of ML estimation provides a lower bound on the number of 
measurements needed by any algorithm that cannot exploit a priori information on x other than it being 
fc-sparse. 

Theorem 1: Let k = k{n) and m = m{n) vary with n such that lim„^oo k{n) = oo and 

r) e 

m(n) < klog,(n — k) + k — 1 (6) 

^ ^ MAR • SNR ^'^ ^ 

for some 5 > 0. Then even the ML estimator asymptotically cannot detect the sparsity pattern, i.e.. 



lim Pr /ml = /true =0. 
Proof: See Appendix |BJ ■ 

The theorem shows that for fixed SNR and MAR, the scaling m = D.{klog{n — k)) is necessary for 
reliable sparsity pattern recovery. The next section will show that this scaling can be achieved with an 
extremely simple method. 
Remarks: 

1) The theorem applies for any k{n) such that lim„^oo k{n) = oo, including both cases with k = o{n) 
and A; = 6(n). In particular, under linear sparsity {k = an for some constant a), the theorem shows 

that 

2a 

m X n log n 

MAR ■ SNR 

measurements are necessary for sparsity recovery. Similarly, if m/n is bounded above by a constant, 
then sparsity recovery will certainly fail unless k = 0{n/ logn). 

The formulation of [15] makes SNR = 0{n), which obscures the effect of the noise level. See also the second remark following 
Theorem |2] 
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2) In the case of MAR • SNR < 1, the bound © improves upon the necessary condition of [15] for the 
asymptotic success of lasso by the factor (MAR ■ SNR)^^. 

3) The bound Q can be compared against the information-theoretic bounds mentioned earlier. The 
tightest of these bounds is in [17] and shows that the problem dimensions must satisfy 

2 /^n\ SNR 

-log2 J <log2(l + SNR)-alog2(l + ), (7) 



where a = k/n is the sparsity ratio. For large n and k, the bound can be rearranged as 



2h(a) ' "'■" '"^ 

m > — ^^ 



a 



SNR, 

log2(l + SNR)-alog2(l 



a 



where h(-) is the binary entropy function. In particular, when the sparsity ratio a is fixed, the bound 
shows only that m needs to grow at least linearly with k. In contrast. Theorem [T] shows that with 
fixed sparsity ratio m = n{k\og{n — k)) is necessary for reliable sparsity recovery. Thus, the bound 
in Theorem \\\ is significantly tighter and reveals that the previous information-theoretic necessary 
conditions from [16]-[18], [21], [22] are overly optimistic. 

4) Results more similar to Theorem [l] — based on direct analyses of error events rather than information- 
theoretic arguments — appeared in [19], [20]. The previous results showed that with fixed SNR as 
defined here, sparsity recovery with m = Q(k) must fail. The more refined analysis in this paper 
gives the additional log(n — k) factor and the precise dependence on MAR • SNR. 

5) Theorem \\\ is not contradicted by the relevant sufficient condition of [21], [22]. That sufficient 
condition holds for scaling that gives linear sparsity and MAR -SNR = ^l(^yn\ogn). For MAR -SNR = 
i/n log n. Theorem [H shows that fewer than m x 2y/k\ogk measurements will cause ML decoding 
to fail, while [22, Thm. 3.1] shows that a typicality-based decoder will succeed with m = Q{k) 
measurements. 

6) Note that the necessary condition of [18] is proven for MAR = 1. Theorem [H gives a bound that 
increases for smaller MAR; this suggests (though does not prove, since the condition is merely 
necessary) that smaller MAR makes the problem harder. 

Numerical validation: Computational confirmation of Theorem \T\ is technically impossible, and even 
qualitative support is hard to obtain because of the high complexity of ML detection. Nevertheless, we 
may obtain some evidence through Monte Carlo simulation. 

Fig. [H shows the probability of success of ML detection for n = 20 as k, m, SNR, and MAR are varied, 
with each point representing at least 500 independent trials. Each subpanel gives simulation results for 
k G {1, 2, . . . , 5} and m G {1, 2, ... , 40} for one (SNR, MAR) pair. Signals with MAR < 1 are created 
by having one small nonzero component and k — 1 equal, larger nonzero components. Overlaid on the 
color-intensity plots is a black curve representing ©. 

Taking any one column of one subpanel from bottom to top shows that as m is increased, there 
is a transition from ML failing to ML succeeding. One can see that Q follows the failure- success 
transition qualitatively. In particular, the empirical dependence on SNR and MAR approximately follows 
^. Empirically, for the (small) value ofn = 20, it seems that with MAR -SNR held fixed, sparsity recovery 
becomes easier as SNR increases (and MAR decreases). 

Less extensive Monte Carlo simulations for n = 40 are reported in Fig. [2l The results are qualitatively 
similar. As might be expected, the transition from low to high probability of successful recovery as a 
function of m appears more sharp at n = 40 than at n = 20. 

IV. Sufficient Condition with Maximum Correlation Detection 

Consider the following simple estimator. As before, let aj be the jth column of the random matrix A. 
Define the maximum correlation (MC) estimate as 

-^Mc = {j ■ {o-'jUl is one of the k largest values of |a-y|} . (8) 
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Fig. 1. Simulated success probability of ML detection for n = 20 and many values of k, m, SNR, and MAR. Each subfigure 
simulation results for fc G {1, 2, . . . , 5} and m G {1,2,..., 40} for one (SNR, MAR) pair. Each subfigure heading gives (SNR, MAR), 
point represents at least 500 independent trials. Overlaid on the color-intensity plots is a black curve representing ([6}. 
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Fig. 2. Simulated success probability of ML detection for n = 40; SNR = 10; MAR = 1 (left) or MAR = 0.5 (right); and many values of 
k and m. Each subfigure gives simulation results for k G {1, 2, . . . , 5} and m £ {1,2, ... , 40}, with each point representing at least 1000 
independent trials. Overlaid on the color-intensity plots (with scale as in Fig. [Hi is a black curve representing (|6}. 



This algorithm simply correlates the observed signal y with all the frame vectors aj and selects the indices 

j with the highest correlation. It is significantly simpler than both lasso and matching pursuit and is not 

meant to be proposed as a competitive alternative. Rather, the MC method is introduced and analyzed to 

illustrate that a trivial method can obtain optimal scaling with respect to n and k. 

Theorem 2: Let k = k{n) and m = m{n) vary with n such that lim^^oo k = oo, limsup^^^^ k/n < 

1/2, and 

(8 + (5)(l + SNR), , , ,, 

m > ^ ^^ ^-k log ra - k) (9) 

MAR -SNR tav ^ 

for some 6 > 0. Then the maximum correlation estimator asymptotically detects the sparsity pattern, i.e., 



Proof: See Appendix | 



lim Pr 1 1] 

n— >oo 



MC 
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Remarks: 

1) Comparing Q and Q, we see that for a fixed SNR and minimum-to-average ratio, the simple 
MC estimator needs only a constant factor more measurements than the optimal ML estimator. 
In particular, the results show that the scaling of the minimum number of measurements m = 
Q{k\og{n — k)) is both necessary and sufficient. Moreover, the optimal scaling factor not only does 
not require ML estimation, it does not even require lasso or matching pursuit — it can be achieved 
with a remarkably simply method such as maximum correlation. 

There is, of course, a difference in the constant factors of the expressions ^ and ^. Specifically, 
the MC method requires a factor 4(1 + SNR) more measurements than ML detection. In particular, 
for low SNRs (i.e. SNR < 1), the factor reduces to 4. 

2) For high SNRs, the gap between the MC estimator and the ML estimator can be large. In particular, 
the lower bound on the number of measurements required by ML decreases to A; — 1 as SNR — > ooO 
In contrast, with the MC estimator increasing the SNR has diminishing returns: as SNR — > oo, the 
bound on the number of measurements in Q approaches 

o 

m> k\og{n — k). (10) 

MAR ^^ ^ 

Thus, even with SNR -^ oo, the minimum number of measurements is not improved from m = 
0{k\og{n-k)). 

This diminishing returns for improved SNR exhibited by the MC method is also a problem for 
more sophisticated methods such as lasso. For example, the analysis of [15] shows that when the 
SNR = 0(n) (so SNR — > oo) and MAR is bounded strictly away from zero, lasso requires 

m > 2fclog(n- A;) + A; + 1 (11) 

for reliable recovery. Therefore, like the MC method, lasso does not achieve a scaling better than 
m = 0{klog{n — k)), even at infinite SNR. 

3) There is certainly a gap between MC and lasso. Comparing (flOl) and (fTTI) . we see that, at high 
SNR, the simple MC method requires a factor of at most 4/MAR more measurements than lasso. 
This factor is largest when MAR is small, which occurs when there are relatively small non-zero 
components. Thus, in the high SNR regime, the main benefit of lasso is not that it achieves an 
optimal scaling with respect to k and n (which can be achieved with the simpler MC), but rather 
that lasso is able to detect small coefficients, even when they are much below the average power. 

Numerical validation: MC sparsity pattern detection is extremely simple and can thus be simulated 
easily for large problem sizes. Fig. [3] reports the results of a large number Monte Carlo simulations of the 
MC method with n = 100. The threshold predicted by Q matches well to the parameter combinations 
where the probability of success drops below about 0.995. 

V. Conclusions 

We have considered the problem of detecting the sparsity pattern of a sparse vector from noisy random 
linear measurements. Our main conclusions are: 

• Necessary and sufficient scaling with respect to n and k. For a given SNR and minimum-to-average 
ratio, the scaling of the number of measurements 

m = 0{klog{n — k)) 

is both necessary and sufficient for asymptotically reliable sparsity pattern detection. This scaling is 
significantly worse than predicted by previous information-theoretic bounds. 

^Of course, at least fc + 1 measurements are necessary. 
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Fig. 3. Simulated success probability of MC detection for n — 100 and many values of k, m, SNR, and MAR. Each subfigure gives 
simulation results for fc G {1, 2, . . . , 20} and m G {25, 50, ... , 1000} for one (SNR, MAR) pair. Each subfigure heading gives (SNR, MAR), 
so SNR = 1, 10, 100 for the three columns and MAR — 1, 0.5, 0.2 for the three rows. Each point represents 1000 independent trials. Overlaid 
on the color-intensity plots (with scale as in Fig.[T]( is a black curve representing (|9}. 



Scaling optimality of a trivial method. The optimal scaling with respect to k and n can be achieved 
with a trivial maximum correlation (MC) method. In particular, both lasso and OMP, while likely to 
do better, are not necessary to achieve this scaling. 

Dependence on SNR. While the threshold number of measurements for ML and MC sparsity recovery 
to be successful have the same dependence on n and k, the dependence on SNR differs significantly. 
Specifically, the MC method requires a factor of up to 4(1 + SNR) more measurements than ML. 
Moreover, as SNR -^ oo, the number of measurements required by ML may be as low as m = A; + 1. 
In contrast, even letting SNR -^ oo, the maximum correlation method still requires a scaling m = 
0{k\og{n-k)). 

Lasso and dependence on MAR. MC can also be compared to lasso, at least in the high SNR regime. 
There is a potential gap between MC and lasso, but the gap is smaller than the gap to ML. Specifically, 
in the high SNR regime, MC requires at most 4/MAR more measurements than lasso, where MAR is 
the mean-to-average ratio defined in ([5]). Both lasso and MC scale as m = 0{klog{n — k)). Thus, 
the benefit of lasso is not in its scaling with respect to the problem dimensions, but rather its ability 
to detect the sparsity pattern even in the presence of relatively small nonzero coefficients (i.e. low 
MAR). 
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While our results settle the question of the optimal scaling of the number of measurements m in terms 
of k and n, there is clearly a gap in the necessary and sufficient conditions in terms of the scaling of 
the SNR. We have seen that full ML estimation could potentially have a scaling in SNR as small as 
m = 0(1/SNR) + k — 1. An open question is whether there is any practical algorithm that can achieve a 
similar scaling. 

A second open issue is to determine conditions for partial sparsity recovery. The above results define 
conditions for recovering all the positions in the sparsity pattern. However, in many practical applications, 
obtaining some large fraction of these positions would be sufficient. Neither the limits of partial sparsity 
recovery nor the performance of practical algorithms are completely understood, though some results have 
been reported in [20]-[22], [24]. 

Appendix 

A. Deterministic Necessary Condition 

The proof of Theorem [T] is based on the following deterministic necessary condition for sparsity recovery. 
Recall the notation that for any J C {1, 2, ... , n}, Pj denotes the orthogonal projection onto the span of 
the vectors {aj}j(zj. Additionally, let Pj- = I — Pj denote the orthogonal projection onto the orthogonal 
complement of span({aj}jgj). 

Lemma 1: A necessary condition for ML detection to succeed (i.e. /ml = -^true) is: 

for all i e /true and j ^ I,,^,, ' ] f > ' Jl (12) 

dj^ ij^di (X j j^ dj 

where K = Itmc \ {«}• 

Proof: Note that y = PkV + PkI) is an orthogonal decomposition of y into the portions inside and 
outside the subspace S = span({aj}jgi<-). An approximation of y in subspace S leaves residual P^V- 
Intuitively, the condition (fT2l) requires that the residual be at least as highly correlated with the remaining 
"correct" vector a^ as it is with any of the "incorrect" vectors {ctjlj^/truo- 
Fix any i e /true, j ^ ^true and let 

J = /rU{j} = (/true\W)U{j}. 

That is, J is equal to the true sparsity pattern /true, except that a single "correct" index i has been replaced 
by an "incorrect" index j. If the ML estimator is to select /ml = /true then the energy of the noisy vector 
y must be larger on the true subspace /true, than the incorrect subspace J. Therefore, 

\\Pu...yr > WPjyf- (13) 

Now, a simple application of the matrix inversion lemma shows that since /true = K U {i}, 

iii^/...yr = iii^A'yf+q^. (14a) 



Also, since J = K U {j}. 



\\PJyr=\\PKyr + ^^. (14b) 



Substituting (fT4al) - (ll4bl) into ^3^ and cancelling ||Pft-2/||^ shows (IT2l) . 
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B. Proof of Theorem [7] 

To simplify notation, assume without loss of generality that /true = {1, 2, . . . , k}. Also, assume that the 
minimization in ([5]) occurs at j = 1 with 

IxiP = — SNR-MAR. (15) 

k 

Finally, since adding measurements (i.e. increasing m) can only improve the chances that ML detection 
will work, we can assume that in addition to satisfying ^, the numbers of measurements satisfy the 
lower bound 

m > ek \og{n — k) + k ~ 1, (16) 

for some e > 0. This assumption implies that 

lim^"g^V^^=liml = 0. (17) 

m — k + 1 ek 

Here and in the remainder of the proof the limits are as m, n and k -^ oo subject to ^ and (fT6l) . With 
these requirements on m, we need to show limPr(/ML = /true) = 0. 
From Lemma [H /ml = /true implies (fT2)) . Thus 

Pr ( 7ml = 7truc 



\„l p-L„,|2 \f.i P-L,,!' 



d^ij^CLi d-ij^dj 




where 



Pr(A" > A" 



A" 



WiPkVI 



log{n — k) d'lP^di 

1 W.Pky 

max 



2 



-.dj 



\og{n - k) ie{fe+i,...,n} d'-Pj^c 

and K = /true \ {1} = {2, • • • , A;}- The — and + superscripts are used to reflect that A^ is the energy lost 
from removing "correct" index 1, and A+ is the energy added from adding the worst "incorrect" index. 
The theorem will be proven if we can show that 

limsup A^ < liminf A^ (18) 

with probability approaching one. We will consider the two limits separately. 

1) Limit o/ A+.- Let Vk be the A; — 1 dimensional space spanned by the vectors {dj}j^K- For each 
j ^ /true, let Uj be the unit vector 

«i = ^x«i/ll^i«ill- 

Since dj has i.i.d. Gaussian components, it is spherically symmetric. Also, if j ^ K, dj is independent 
of the subspace Vk- Hence, in this case, Uj will be a unit vector uniformly distributed on the unit sphere 
in V^. Since V^ is an m — /c + 1 dimensional subspace, it follows from Lemma |4] (see Appendix JD]) 
that if we define 

z, = \u'^Pky\V\\Pky\\\ 

then Zj follows a Beta(l,m — A; + 1) distribution. See Appendix IDJ for a review of the chi-squared and 
beta distributions and some simple results on these variables that will be used in the proofs below. 
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By the definition of Uj, 



and therefore 



' D-L„,|2 



W.PkV 



./p±^ - lUiPxyl =Zj\\PKy\ 



ajFj^ttj 



A+ = --^-—WP^yf max Zj. (19) 

log(n - k) je{k+i,...,n} 

Now the vectors aj are independent of one another, and for j ^ /true, each aj is independent of P^V- 
Therefore, the variables Zj will be i.i.d. Hence, using Lemma |5] (see Appendix iDl) and (flTI) . 

rn — k + 1 
lim - — — max Zj = 2 (20) 

log(r2 — k) j=k+l,...,n 

in distribution. Also, 

1 , {«) 1 , 

liminf ; -Pa-V > lim ; \\Pt v\\ 

(J lim 1 IIP/ df 

m-k + 1 "■"•= ' 

(c) m — k 

= lim ^ = 1 (21) 

m — k + 1 

where (a) follows from the fact that K C /true and hence ||P/y|| > ||P/[;,^^^y||; (b) is valid since P^^^^^clj = 
for all j E /true and therefore Pi^^^^x = 0; and (c) follows from the fact that P/ rf is a unit-variance 
white random vector in an m — A; dimensional space. Combining (fT9l) . (l20l) and (|211) shows that 



liminfA+>2. (22) 

2j LzmzY of A^: For any j G /C, /^i^flj = 0. Therefore, 

P^y = P^ 2_. ^j^j + d] = XiP^ai + P^d. 



Hence, 



where v is given by 



\n' P-L7/I2 9 

"1-' Ki/ I II D-L II I 1^ 

——J — = |||P^^ai||xi +f| 



V = a[P^d/\\P^ai 



Since P/ai/||Pj:|;ai|| is a random unit vector independent of d, and d is a zero-mean, unit-variance 
Gaussian random vector, v ~ A/'(0, 1). Therefore, 

llPirtJI.T, 1 ^ 

lim A = lim 



|P/ai||xi 1 



log^^\n-k) \og^^\n-k) 

II p±„ ||2|„ |2 

= lim-j — ^ 7^, (23) 

log(n — k) 

where, in the last step, we used the fact that 

v/log^^^{n-k) ^0. 

Now, ai is a Gaussian vector with variance 1/m in each component and P/ is a projection onto an 
m — k + 1 dimensional space. Hence, 

lim!^^^^ = l. (24) 

m — k + 1 
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Starting with a combination of (|23] ) and (l24l) . 

|a:ip(m — k + 1) 



lim sup A = liin sup 



m log{n — k) 



(a) , (SNR-MAR)(m- A; + l) 

= lim sup — 



k log(n — k) 

(b) 

< 2-6 (25) 



where (a) uses (flSl) : and (b) uses ^. 

Comparing (l22l) and (l25l) proves (fTSi) . thus completing the proof. 



C. Proof of Theorem |2] 

We will show that there exists a yU > such that, with high probability, 

\a[y\^ > fi for all i E /true; 
\a'jy\^ < IJ, for all j ^ /true- 

When (|26l) is satisfied, 

|a-?/| > la'jul for all indices z G /true and j ^ / 



(26) 



true- 



Thus, (|26l) implies that the maximum correlation estimator /mc in dS]) will select /mc = /true- Consequently, 
the theorem will be proven if can find a fi such that (|26l ) holds with high probability. 
Since (5 > 0, we can find an e > such that 

VS + 5 - V2 + e > v^. (27) 

Define 

/i= (2 + e)(l + SNR)log(n- A;). (28) 

Define two probabilities corresponding to the two conditions in (|26l) : 



Pmd = Ft {{a'-yl"^ < jj for some i G /true) (29) 

Pfa = Pr (la^yp > /i for some j ^ /true) - (30) 

The first probability pmd is the probability of missed detection, i.e., the probability that the energy on one 
of the "true" vectors, a/ with i G /true, is below the threshold fi. The second probability ppA is the false 
alarm probability, i.e., the probability that the energy on one of the "incorrect" vectors, aj with j /true. 
is above the threshold /i. Since the correlation estimator detects the correct sparsity pattern when there 
are no missed vectors or false alarms, we have the bound 



Pr f /mc 7^ /true] < PMB + PFA- 



So the result will be proven if we can show that pmd and pfa approach zero as m, n and k -^ oo 
satisfying ^. We analyze these two probabilities separately. 



FLETCHER, RANGAN AND GOYAL 13 

1) Limit of pfa: Consider any index j ^ /true- Since y is a linear combination of vectors {aj}je/truo 
and the noise vector d, aj is independent of y. Also, recall that the components of Uj are A/'(0, 1/m). 
Therefore, conditional on ||y|p, the inner product a'^y is Gaussian with mean zero and variance Hyp/m. 
For large m, Hyp/m ^ 1 + SNR. Hence, we can write 

\a'.y\^ = {1 + Sm)ul 

where Uj is a random variable that converges in distribution to a zero mean Gaussian with unit variance. 



Using the definitions of pfa in (|30l) and fi in (1281) . we see that 

Pfa = Pr ( max |a'|/|^ > /i 

\j0U-uc ■' 



Pr ( max (1 + SNR)m^ > /i ) 



Pr ( max u] > yu/(l + SMR^ 



Pr ( max Uj > {2 + e) \og{n - k) ] ^ 



where the last limit uses Lemma [3] (see Appendix iDl) on the maxima of chi-squared random variables. 
2) Limit of pmd- Consider any index i E /true- Observe that 

where 

ei = y - ttiXi = ^ aexe + d. 

It is easily verified that Hajp — > 1 and ||ejp/m — > 1 +SNR. Using a similar argument as above, one can 
show that 

a[y = Xi + {1 + SNRy/\i, (31) 

where Ui approaches a zero-mean, unit-variance Gaussian in distribution. 
Now, using dH), © and ^, 

, ,, MARllxlP mMAR-SNR 

> (8 + (5)(l + SNR)log(n-A;). (32) 

Combining ([27]), ^, (HB and ^ 

\a'iy\'^<fi ^^ I Xi + (l + SNR)^/^Mi| < /i^/^ 
=^ (1 + SNR)m^> (|a;i|-//2)^ 
<^^ uj > 2 \og{n - k) 
=^ u]>2\og{k) 

where, in the last step, we have used the fact that since k/n < 1/2, n — k > k. Therefore, using Lemma 
[3] 

'„.|2 



Pmd = Pr min \a^y\ < fi 



2 



< Pr max uf > 2 \og{k) -^ 0. (33) 

Hence, we have shown both ppA -^ and pmd ^ as n ^ oo, and the theorem is proven. 
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D. Maxima of Chi-Squared and Beta Random Variables 

The proofs of the main results above require a few simple results on the maxima of large numbers of 
chi-squared and beta random variables. A complete description of chi-squared and beta random variables 
can be found in [25]. 

A random variable U has a chi-squared distribution with r degrees of freedom if it can be written as 



c = E^^ 



j=i 



where Zi are i.i.d. A/'(0, 1). For every n and r define the random variables 






max Ui, 

ie{l,...,n} 

min Ui, 

ie{l,...,n} 



where the f/j's are i.i.d. chi-squared with r degrees of freedom. 
Lemma 2: For M„ ^ defined as above, 



lim 



■M„,i = 2, 



log(n 

where the convergence is in distribution. 

Proof: We can write M„ i = maxjgji „} Zf where Zi are i.i.d. A/'(0, 1). Then, for any a > 0, 



Pr 



-M„ 1 < a 



log(n) 
Pr(|Zi|2<alog(n))' 

erf ( ^/a\og{n)/2 



Tia log(n^ 



1 



exp(— alog(n)/2) 



1 



na log(n) n"/^ 
where the approximation is valid for large n. Taking the limit as n ^ oo, one can now easily show that 

lim Pr 



1 _ 

M„ < a 



log(n 

and therefore M„/ log(n) ^ 2 in distribution. 

Lemma 3: In any limit where r -^ oo and log(n)/r -^ 0, 

1 1 

lim -Mr 



0, for a < 2; 

1, for a > 2 



. n^r ^ lim -M„ ^ 



where the convergence is in distribution. 
Proof: It suffices to show 



limsup-M„,, < 1, 
lim inf -M„ ^ > 1 . 
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We will just prove the first inequality since the proof of the second is similar. We can write 

1_ 

-Mnr = max Vi, 

r ' j=l,...,n 

where each Vi = Ui/r and the f/j's are i.i.d. chi-squared random variables with r degree of freedom. 
Using the characteristic function of Ui and Chebyshev's inequality, one can show that for all e > 0, 

Pr(V;>(l + e)) = Pr([/, > (l + e)r) 
< (1 + e)e-^''/2. 

Therefore, 

Pr {Mn,r < 1 + e) = [Pr(y. < 1 + e)]" 

> [l-(l + e)e-^/2]" 

> 1 - (1 + e)ne-^''/2 

= 1 — (1 + e) exp [log(n) — er/2] 

^ 1, 

where the limit in the last step follows from the fact that log(n)/r -^ 0. Since this is true for all e 
it follows that lim sup r~^M„,r < 1. Similarly, one can show liminf r~^M„,r > 1 and this proves the 
lemma. ■ 

The next two lemmas concern certain beta distributed random variables. A real-valued scalar random 
variable W follows a Beta(r, s) distribution if it can be written as 

W = Ur/{Ur + Vs), 

where the variables Ur and Vg are independent chi-squared random variables with r and s degrees of 
freedom, respectively. The importance of the beta distribution is given by the following lemma. 

Lemma 4: Let x and m G M^ be any two independent random vectors, with u being uniformly distributed 
on the unit sphere. Let w = \u'x\'^/\\x\\'^ be the energy of w projected onto u. Then w is independent of 
X and follows a Beta(l, s — 1) distribution. 

Proof: This can be proven along the lines of the arguments in [9]. ■ 

The following lemma provides a simple expression for the maxima of certain beta distributed variables. 

Lemma 5: Given any s and n, let Wj^g, j = 1, . . . , n, be i.i.d. Beta(l, s — 1) random variables and define 

Tn,s = max Wj^s- 

j=l,...,n 

Then for any limit with n and s — > oo and \og{n)/s -^ 0, 

s 
lim -. — r^Tn,s = 2, 

where the convergence is in distribution. 

Proof: We can write Wj^s = Uj/{uj + Vj^s-i) where Uj and fj>-i are independent chi-squared random 
variables with 1 and s — 1 degrees of freedom, respectively. Let 

Mn = max Uj 
ie{i,...,n} 

Mn,s-i = max Vj^s-i 
je{i,...,n} 

Kn,s-i = min Vj^s^i. 
je{i,...,n} 



Using the definition of T„ 



T < " 

-^ n,s ^ 



Mn + M^,s-1 
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Now Lemmas [2] and [3] and the hypothesis of this lemma show that M„/ \og{n) -^ 2, M „ s_i/{s — 1) — i> 1, 

and log (n)/s -^ 0. One can combine these limits to show that 

g 
limsup - — -r--Tn.s < 2. 

n,s^oo log(n) ' 



Similarly, one can show that 
and therefore sTn,s/ log(n) -^ 2. 



g 
liminf - — ^^T„,^ > 2, 

n,s^oo log(nj 
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